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Abstract 

We address the issue of variable selection in the regression model with very high ambient dimen- 
sion, i.e., when the number of covariates is very large. The main focus is on the situation where the 
number of relevant covariates, called intrinsic dimension, is much smaller than the ambient dimen- 
sion. Without assuming any parametric form of the underlying regression function, we get tight con- 
ditions making it possible to consistently estimate the set of relevant variables. These conditions 
relate the intrinsic dimension to the ambient dimension and to the sample size. The procedure that 
is provably consistent under these tight conditions is simple and is based on comparing the empirical 
Fourier coefficients with an appropriately chosen threshold value. 



1 Introduction 



Real-wrorld data such as those obtained from neuroscience, chemometrics, data mining, or sensor- 
rich environments are often extremely high-dimensional, severely underconstrained {few data samples 
compared to the dimensionality of the data), and interspersed with a large number of irrelevant or re- 
dundant features. Furthermore, in most situations the data is contaminated by noise making it even 
more difficult to retrieve useful information from the data. Relevant variable selection is a compelling 
approach for addressing st atistical issues in the scenario o f high-dimensio nal and noisy data with small 
sample size. Starting from Mallows! jl973 ). Akaikd Il973 1. Schwarz jl978[) who introduced respectively 
the famous criteria Cp, AIC and BIC, the problem of variable selection has been extensively studied in 
the statistical and machine learning literature both from the theoretical and algorithmic viewpoints. 
It appears, however, that the theoretical limits of performing variable selection in the context of non- 
parametric regression are still poorly understood, especially in the case where the ambient dimension 
of covariates, denoted by d, is much larger than the sample size n. The purpose of the present work 
is to explore this setting under the assumption that the number of relevant covariates, hereafter called 
intrinsic dimension and denoted hy d*, may grow with the sample size but remains much smaller than 
the ambient dimension d . 

In the important particular case of linear regression, the latter scenario has been the s ubject of a 
nimib e r of recent st i idies. Many of them rely on f i-norm pe nalization (as for instance in Tibshiran^ 
jl996l) . Zhao and Yul 120061) . Meinshausen and Biihlmannl boiG)) and constitute an attractive alterna- 
tive to iterative variable selection procedures proposed byiAlquier 1 2008) , Zhang ( 2009) , Ting et al. ( 2 O10l) 
and to marginal regression or correlation screening explored in iWasserman a nd Roederl l l2009l) . iFan et al 



2009j). Promising results for featu re selection are al so obtained by minimax concave p enalties inlZhans 



201C), by Bayesian approach in Scott and Bergerl |20I0) and by higher criticism in Donoho and lin 



2009f) . Extensions to other setti ngs including logistic r e gression, general i zed li n ear mode l and I sing 



model have been carried out in B unea and Barbu l2009l) . Ravikumar et al. boiol) . Fan et al. HoO^, re- 
spectively. Variable selection in the context of groups of variables with disjoint or overlapping groups 
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has been studied bv Tenatton et al. l2009l) . Lounici et alllzoioll . Obozinski et alJ iioilll. Hierarchical pro 



cedures for selection of relevant covariates have been proposed by BachI l2009l) . Bickel et al. 2O10l) and 
Izhao etal. 12009). 

It is now well understood that in the high- dimensional linear regression, if the Gram matrix sat- 
isfies some variant of irrepresentable condition, then consistent estimation of the pattern of relevant 
variables — also called the sparsity pattern — is possible under the condition d*log(d/d*) — o{n) as « — » 
cxD. Furthermore, it is well known that if {d*log{d /d*))/n remains bounded from below by some pos- 
itive constant when n —> oo, then it is impossible to consistently recover the sparsity pattern. Thus, a 
tight condition exists that describes in an exhaustive manner the interplay between the quantities d*, 
d and n that guarantees the existence of consistent estimators. The situation is very different in the 
case of non-linear regression, since, to our knowledge, there is no result providing tight conditions for 
consistent estimation of the sparsity pattern. 

The papers Laffertv and WassermanI l l2008h and lBertin and Lecui 120081) . closely related to the present 



work, consider the problem of variable selection in nonparametric Gaussian regression model. They 
prove the consistency of the proposed procedures under some assumptions that — in the light of the 
present work — turn out to be suboptimal. More precisely, in Laffertv and Wasserman 12008), the un- 
known regression function is assumed to be four times continuously differentiable with bounded deriva- 
tives. The algorithm they propose, termed Rodeo, is a greedy procedure performing simultaneously 
local bandwidth choice and variable selection. Under the assumption that the density of the sampling 
design is continuously differentiable and strictly positive. Rodeo is shown to converge when the ambi- 
ent dimension d is 0(log n/loglog n] while the intrinsic dimension d* does not increase with n. On the 
other hand, Bertin and Lecue (2008) propose a procedure based on the i!i -penalization of local polyno- 
mial estimators and prove its consistency when d* — 0(1) but d is allowed to be as large as logn, up 
to a multiplicative constant. They also have a weaker assumption on the regression function which is 
merely assumed to belong to the Holder class with smoothness ji>\. 

This brief review of the literature reveals that there is an important gap in consistency conditions 
for the linear regression and for the non-linear one. For instance, if the intrinsic dimension d* is fixed, 
then the condition guaranteeing consistent estimation of the sparsity pattern is (logrf)/« ^ in lin- 
ear regression whereas it is = 0(log«) in the nonparametric case. While it is undeniable that the 
nonparametric regression is much more complex than the linear one, it is however not easy to find a 
justification to such an important gap between two conditions. The situation is even worse in the case 
where d* —> oo. In fact, for the linear model with at most polynomially increasing ambient dimension 
d — 0{n''], it is possible to estimate the sparsity pattern for intrinsic dimensions d* as large as '°, 
for some e > 0. In other words, the sparsity index can be almost on the same order as the sample size. 
In contrast, in nonparametric regression, there is no procedure that is proved to converge to the true 
sparsity pattern when both n and d* tend to infinity, even if d* grows extremely slowly. 

In the present work, we fill this gap by introducing a simple variable selection procedure that selects 
the relevant variables by comparing some well chosen empirical Fourier coefficients to a prescribed 
significance level. Consistency of this procedure is established under some conditions on the triplet 
[d*, d, n) and the tightness of these conditions is proved. The main take-away messages deduced from 
our results are the following: 

y When the number of relevant covariates d* is fixed and the sample size n tends to infinity, there 
exist positive real numbers c* and c* such that (a) if (logd)/« < c* the estimator proposed in 
Section[3]is consistent and (b) no estimator of the sparsity pattern may be consistent if (log > 

c*. 

/ When the number of relevant covariates d* tends to infinity with n ^ oo, then there exist real 
numbers c_- and Ci, i — \,...,A such that c, > 0, c,- > for / — 1,2,3 and (a) if c_^d* + c_.^\o%d* + 
Cgloglogrf — logn < the estimator proposed in Section|3]is consistent and (b) no estimator of 
the sparsity pattern maybe consistent if Cid* + C2logrf* + Caloglogrf — logn > C4. 

/ In particular, if d grows not faster than a polynomial in n, then there exist positive real numbers 
Co and c° such that (a) if d* < Cologn the estimator proposed in Section|3]is consistent and (b) no 



2 



estimator of the sparsity pattern may be consistent if d > c" log 

Very surprisingly, the derivation of these results required from us to apply some tools from complex 
analysis, such as the Jacobi 6 -function and the saddle point method, in order to evaluate the number 
of lattice points lying in a ball of an Euclidean space with increasing dimension. 

The rest of the paper is organized as foUov^rs. The notation and assumptions necessary for stating 
our main results are presented in Section |2l In Section |3] an estimator of the set of relevant covariates 
is introduced and its consistency is established. The principal condition required in the consistency 
result involves the number of lattice points in a ball of a high- dimensional Euclidean space. An asymp- 
totic equivalent for this number is obtained in Section|4]via the Jacobi 6 -function and the saddle point 
method. Results on impossibility of consistent estimation of the sparsity pattern are derived in Sec- 
tion|5l while the relation between consistency and inconsistency results are discussed in Section|6l The 
technical parts of the proofs are postponed to the Appendix. 



2 Notation and assumptions 

We assume that n independent and identically distributed pairs of input-output variables [Xj, Yi), i — 
are observed that obey the regression model 

The input variables Xi,...,Xn are assumed to take values in while the output variables Fi , . . . , F„ are 
scalar. As usual, the noise ei, . . . , e„ is such that E[e,|X,] = 0, i — \,...,n; some additional conditions will 
be imposed later. Without requiring from f to be of a special parametric form, we aim at recovering the 
set / c { 1, . . . , rf} of its relevant covariates. 

It is clear that the estimation of / cannot be accomplished without imposing some further assump- 
tions on f and the distribution Px of the input variables. Roughly speaking, we will assume that f is 
differentiable with a squared integrable gradient and that Px admits a density which is bounded from 
below. More precisely, let g denote the density of Px w.r.t. the Lebesgue measure. 

[CI] We assume that g(x) — for any x ^ [0, l]** and that g{x) > gmm for any jc e [0, 1]''. 

To describe the smoothness assumption imposed on f , let us introduce the Fourier basis 



1, fc = 0, 

\/2cos(27rfc-x), fce(Z'')+, (1) 
\/2sin(27rfc-x), -fce(Z'')+, 



where (Z'^)+ denotes the set of all fc e Z'^ \ {0} such that the first nonzero element of k is positive and 
k ■ X stands for the the usual inner product in . In what follows, we use the notation (•, •) for designing 
the scalar product in ^^([o, i]'';R), that is (h,h) = j h(x)h(x)dx for every h,h e L2([0, 1]^;R). Using 



'[0,1]' 

this orthonormal Fourier basis, we define 



To ease notation, we set 0fc[f] = (f, for all fc e Z''. In addition to the smoothness, we need also 
to require that the relevant covariates are sufficiently relevant for making their identification possible. 
This is done by means of the following condition. 

[C2(7C, L)] The regression function f belongs to E^, . Furthermore, for some subset / c {\,...,d] of car- 
dinality < d*, there exists a function f : Rl^l R such that f(x) = ^(xj), "ix e R'' and it holds 
that 

k-.kj^O 
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Hereafter, we will refer to / as the sparsity pattern of f . 

One easily checks that Qj [f ] = for every j that does not lie in the sparsity pattern. This provides a 
characterization of the sparsity pattern as the set of indices of nonzero coefficients of the vector Q [f ] — 

(Qi[f],...,Qd[fD. 

The next assumptions imposed to the regression function and to the noise require their bounded- 
ness in an appropriate sense. These assumptions are needed in order to prove, by means of a concen- 
tration inequality, the closeness of the empirical coefficients to the true ones. 

[C3(Loo, Lz]] The L°°([0, l]''M,Px) and L\[0, l]''M,Px) norms of the function f are bounded from above 
respectively by Loo > and Lz, i.e., Px[x e [0, 1]'' : |f(x)| < Loo) — 1 and J^^ f(x)2g(x) dx < L\. 

[C4] The noise variables satisfy a.e. E[e'f' |X,:] < e^'/^ fo,. t > 0. 

Remark 1. The primary aim of this work is to understand when it is possible to estimate the sparsity 
pattern (with theoretical guarantees on the convergence of the estimator) and when it is impossible. The 
estimator that we will define in the next section is intended to show the possibility of consistent estima- 
tion, rather than being a practical procedure for recovering the sparsity pattern. Therefore, the estimator 
will be allowed to depend on the parameters gmin; L, k andM appearing in conditions [C1-C3]. 



3 Consistent estimation of the set of relevant variables 

The estimator of the sparsity pattern / that we are going to introduce now is based on the following 
simple observation: if 7 ^ / then 0^ [f ] = for every k such that kj / 0. In contrast, if 7 e / then there 
exists fc e Z'' with kj / such that |0fc[f]| > 0. To turn this observation into an estimator of /, we start 
by estimating the Fourier coefficients 0^ [f ] by their empirical counterparts: 

nj-^ g(Xi) 

Then, for every f e N and for any i" > 0, we introduce the notation S,„,i — {k&'L'^: \\k\\2 < m, ||fc||o < 1} 
and = {fceZ''* : \\k\\l < Yd*8iki^0}. Finally our estimator is defined by 

T„[m,X)^\j &{!,.. .,d]: max |0fc|>A|, (3) 

where m and A are some parameters to be defined later. The notation aAb, for two real numbers a and 
b, stands for min(a, b). 

Theorem 1. Let conditions [C1-C4] be fulfilled with some known constants guiin,L,K and L2. Assume 
furthermore that the design density g and an upper estimate on the noise magnitude a are available. Set 
m = (2L(iVK)i/2 andX^ 4(cr + L2){d*\og{Qmd)l ngl^J^'^ . If 

L2d*log(6md) „ , \2Q{a + L2fd*N[d*,2LlK)\og[iomd) 

<L,, and <k, (4) 

n ne ■ 

o mm 

then the estimator J[m,X) satisfies P[J[m,X)^ /) < 3(6md)"'^*. 

If we take a look at the conditions of Theorem [1] ensuring the consistency of the estimator /, it be- 
comes clear that the strongest requirement is the second inequality in (4). To some extent, this condi- 
tion requires that {d*N{d*, 2L/K)log,d)/n is bounded from above by some constant. To further analyze 
the interplay between d*, d and n implied by this condition, we need an equivalent to N{d*,2L/K) as 
the intrinsic dimension d* tends to infinity. As proved in the next section, N{d*,2L/K) diverges expo- 
nentially fast, making inequality l|4) impossible for d* larger than log « up to a multiplicative constant. 
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It is also worth stressing that although we require the Px-a.e. boundedness of f by some constant Loo, 
this constant is not needed for computing the estimator proposed in Theorem[T] Only constants related 
to some quadratic functionals of the sequence of Fourier coefficients 9k [f ] are involved in the tuning 
parameters m and A. This point might be important for designing practical estimators of /, sin ce the 
estimation of quadratic functionals is more realistic, see for instance Laurent and Massart j2000l) , than 
the estimation of sup -norm. 

The result stated above provides also a level of relevance k for the covariates of X making their 
identification possible. In fact, an alternative way of reading Theorem[T]is the following: if conditions 
[C1-C4] and L^^d*log{6md] < nL\ are fulfilled, then the estimator J{m,X) — with arbitrary tuning pa- 
rameters m and A — satisfies P(/(m, A) / /) < 3(6md) provided that the smallest level of relevance k 
for components Xj of X with 7 e / is not smaller than 8??N{d*, m^/d*]. 



4 Counting lattice points in a ball 

The aim of the present section is to investigate the properties of the quantity N{d*, m^/d*) that is in- 
volved in the conditions ensuring the consistency of the proposed procedure. Quite surprisingly, the 
asymptotic behavior of N{d*, m^/d*] turns out to be related to the Jacobi 0-function. In order to show 
this, let us introduce some notation. For a positive number j, we set 

^y[d*,r) = |fc e Z'*' : kl + ... + kl. < yd*^, "^iid^y) = jfc e Z^* : fc^ + ... + k^. < yd* & fci = o} 

along with Ni{d*,y] — Card%(d*,)') and N2{d*,y) — Csii:d^2{d*,y). In simple words, Ni{d*,y) is the 
number of (integer) lattice points lying in the rf*-dimensional ball with radius {yd*Y^^ and centered 
at the origin, while N2{d*,y) is the number of (integer) lattice points with the first coordinate equal to 
zero and lying in the rf*-dimensional ball with radius {yd*y^^ and centered at the origin. With these 
notation, the quantity N{d*,2L/K] of Theorem[T]can be written as Ni[d*,2L/K] — N2{d*,2L/K). 

In order to determine the asymptotic behavior of N\{d*,y) and N2{d*,j] when d* tends to infinity, 
we will rely on their integral representation through Jacobi's 6 -function. Recall that the latter is given 
by h(z) — z'^\ which is well defined for any complex number z belonging to the unit ball |z| < 1. To 
briefly explain where the relation between Nily] and the 0-function comes from, let us denote by [ar] 
the sequence of coefficients of the power series of hCz)"^*, that is h(z)'^' — Xr>o '^rZ''. One easily checks 
that Vr e M, flr = Card{fc e Z''' : k^ + ... + k^, — r]. Thus, for every y such that yd* is integer, we have 
Ni[d*, y) — Xrfo ^1 ■ As a consequence of Cauchy's theorem, we get : 

1 f h(z)'^* dz 
Niid*,y)^ — <^ 



2ni] zrd' z(l-z)" 

where the integral is taken over any circle |z | = w with < w < 1. Exploiting this representation and ap- 
plying the saddle-point method thoroughly described in Dieudonne IJL968), we get the following result. 

Proposition 1. Lety > be such that yd* is an integer and Zet l^(z) = logh(z) — j'logz. 

1. There is a unique solution Zj in (0, 1) to the equation l^(z) = 0. Furthermore, the function y^ Zj is 
increasing and \''[z) > 0. 



2. The following equivalences hold true: 
md*,y)^ 
N2(d*,y)^ 

as d* tends to infinity. 



r 



1+0(1) 



^ , z,(l-z,)(2i;'(z,)7rd*)i/2' 
Kzr)Y' 1 + 0(1) 



h(z,)z,(l-z,)(2l"(z,)7rd*)i/2' 
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In the sequel, it will be useful to remark that the second part of Proposition[T]yields 



log iN,[d*, r) - N2{d*, r)) = d\iz,) - - logd* - log 



h(z,)z,(l-z,)(2K;(z,)7r)V^ 
h(z,)-l 



+ 0(1). (5) 



In order to get an idea of how the terms Zj and \j[Zj) depend on )-, we depicted in Figure[T]the plots of 
these quantities as functions of )■ > 0. 
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Figure 1: The plots of mappings Zj and f ^ l^(Zj.) 



5 Tightness of the assumptions 
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In this section, we assume that the errors e, are i.i.d. Gaussian with zero mean and variance 1 and we 
focus our attention on the functional class Y.{k,L) of all functions satisfying assumption [C2(k,L)]. In 
order to avoid irrelevant technicalities and to better convey the main results, we assume that k — 1 and 
denote = E(l, L). Furthermore, we wiU assume that the design Xi , . . . , X,j is fixed and satisfies 



1 " 

-Y,ipkiXi)ip^iXi] 



< 



! = 1 



Ni[d*,Lf 



(6) 



for aU distinct fc, k' e S((i.i)i/2^. c Z''. The goal in this section is to provide conditions under which the 
consistent estimation of the sparsity support is impossible, that is there exists a positive constant c > 
and an integer no s N such that, if n > no, 

infsupPf(7//f)>c, 

where the inf is over all possible estimators of /f . To lower bound the LHS of the last inequality, we 
introduce a set of M + 1 probability distributions /io> • ■ • . Mm on and use the fact that 



infsupPf(///f)>iiif 

I fe£i 



1 " r 



(73 



These measures fj,( will be chosen in such a way that for each i>l there is a set Ji of cardinality d* such 
that jJ-eilf — Je] = I and all the sets J\,...,Jm are distinct. The measure /io is the Dirac measure in 0. 
Considering these /i/ s as "prior" probability measures on and defining the corresponding "posterior" 
probability measures Po,Pi, . . . ,Pm by 

fi{A)- Pf(^)/i/(df), for every measurable set ^cM", 
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we can write the inequality (7) as 



1 

infsupPf(7//f)>inf— — VpKV/^)- (8) 



where the inf is taken over all random variables ip taking values in_[0, ...,M}. The latter inf will be 
controlled using a suitable version of the Fano lemma, see Fano l ll961 ). In what follows, we denote 



by J(fiP, Q) the KuUback-Leibler divergence between two probability measures P and Q defined on the 
same probability space. 

Lemma 1 (Corollary 2.6 of Tsvbakov boosl) ]. Let{X,jif) be a measurable space and let Po,...,Pm be 
probability measures on {3C , j/). Let us set p^.M — inf^(M + ly^ Sf=o ^ ^) u)here the inf is taken 
over all measurable functions ij) : ^ \Q, . .., M} . If for some < a < 1 

1 

— ^jr(P,,Po)<«logM, 



then 

_ log(M+l)-log2 

PeM> i 77 «■ 

logM 

It follows from this lemma that one can deduce a lower bound on pe.w which is the quantity we are 
interested in, from an upper bound on the average KuUback-Leibler divergence between the measures 
Pf and Pq. This roughly means that the measures yn should not be very far from po but the probability 
measures p( should be very different one from another in terms of the sparsity pattern of a function f 
randomly drawn according to pi . This property is ensured by the following result. 

Lemma 2. Suppose po = Sq, the Dirac measure af Oe . Let S bea subset of'L'' of cardinality \S\ and A 
be a constant. Defineps as a discrete measure supported on the finite set of functions {foi — '^kes^^i'fi^ • 
&) e {±1}'^} such that ps[f — fia] — Z^''^' for every co e {±1}^, i.e., the cok 's are i.i.d. Rademacher random 
variables under ps. If for some e>0, the condition 



! = 1 

is fulfilled, then 

^(Pi,Po)<lo: 



1 " 
n -rr' 



\S\€ 



^dPo 

These evaluations lead to the following theorem, that tells us that the conditions to which we have 
resorted for proving the consistency in Section|3]are nearly optimal. 

Theorem 2. Let the design Xi, . ..,X„ e [0, 1]^ be deterministic and satisfy ®. Lety* the largest real num- 
ber such that d* J* is integer and L> y*[1 + l/lZf). If for some positive number a < (logS — log2)/log3 

{Ny{d*,r*)-N2{d*,r*)Y^og{i) a 

— > — , f91 

n^Ni{d*,y*) "5 

then there exists a positive constant c > and a do s N such that, ifd* > do, 

infsupPf(7//f)>c. 

Proof. We apply the Fano lemma with M — (^,). We choose po, ....Pm as follows, po is the Dirac mea- 
sure 5o, /ii is defined as in Lemma|2]with S = and ^ = [Ni[d*,Y*) — N2id*,j*)] The mea- 
sures pz,..., Pm are defined similarly and correspond to the M — 1 remaining sparsity patterns of cardi- 
nality d*. 
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In view of inequality and LemmafT] it suffices to show that the measures ne satisfy nd'^i) — 1 and 
Y^f^o-^^^i'^o) < (M+ l)o!logM. Combining Lemma|2]with \S\ = Ni{d*, y*) and condition (6), one easily 
checks that equation (9) implies the desired bound on Xflo -^^^i > ^o)- 

Let us show now that /ii(Ei) — 1. By symmetry, this will imply that /ifCSi) = 1 for every Since /ii is 
supported by the set {{^-.(o^ {il}'^'''*''"'}, it is clear that 

X ^A'[N,id*,f)-N2{d*,f)] = 1 

and, for every j — I,..., d*, 

By virtue of Proposition [1] as d* tends to infinity, Ni{d*,j*)/N2{d*,j*) is asymptotically equivalent to 
h{zyt) > \ + 2Zf. Hence, for d* large enough, 

A^N.id*, f) = ^^^^^ < — + 1. 

' ' Niid*,f)-N2id*,r*) 2Zf 

As a consequence, for every j — \,...,d*, 

where the last inequality follows from the definition oiy*. □ 

Note that Theorem[2]is concerned by the case where the intrinsic dimension is not too small, which 
is the most interesting case in the present context. However, a much simpler result can be established 
showing that the conditions of Theorem[T]are tight in the case of fixed intrinsic dimension as well. 

Proposition 2. Let the design X\,...,Xn e [0,1]'' be either deterministic or random. If for some positive 
a < (log3 — log2)/log3, the inequaUty 

rf*flog£/ — logd*l , 
n 

holds true, then there is a constant c > such thatinfj sup^^g^ Pf(/,j / Jf)>c. 



6 Discussion 

The results proved in previous sections almost exhaustively answer the questions on the existence of 
consistent estimators of the sparsity pattern in the problem of nonparametric regression. In fact as 
far as only rates of convergence are of interest, the result obtained in Theorem[T]is shown in Section[5] 
to be unimprovable. Thus only the problem of finding sharp constants remains open. To make these 
statements more precise, let us consider the simplified set-up a — k — I and define the following two 
regimes: 

/ The regime of fixed sparsity, i.e., when the sample size n and the ambient dimension d tend to 
infinity but the intrinsic dimension d* remains constant or bounded. 

/ The regime of increasing sparsity, i.e., when the intrinsic dimension d* tends to infinity along with 
the sample size n and the ambient dimension d. For simplicity, we will assume that d* — Oid^^*^) 
for some e > 0. 
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In the fixed sparsity regime, in view of Theorem[T] consistent estimation of tlie sparsity pattern can be 
acliieved using the estimator / as soon as (logd)/ n < c*, where is the constant defined by 



c* — mm 



^2 6min 



^2d*Ll^' 2'^il + L2fd*N[d*,2L)J' 

This follows from the fact that the tuning parameter m is fixed and that the probability of the error, 
bounded by 3(6md)''* tends to zero as d ^oo. On the other hand, by virtue of Proposition[2l consistent 
estimation of the sparsity pattern is impossible if (logrf)/« > c*, where c* — 21og3/(rf*log(3/2)). Thus, 
up to multiplicative constants c» and c* (which are clearly not sharp), the result of Theorem [1] can not 
be improved. 

In the regime of increasing sparsity, the second inequality in (4) is the most stringent one. Taking 
the logarithm of both sides and using formula (5) for N{d*,2L] — N\[d*,2L) — N2{d*,2L], we see that 
consistent estimation of / is possible when 

£id*+ - logrf* + loglogd — logn < Cj, (10) 

with £i = \2dz2L] and £2 = 2(log(g^i„) - log(17(a + L2]) + log{ ''^'^'^^^^y/^Jj^f^'^'^'"'"' }. On the other 

hand, by virtue of (5), log { 1^^^^^%^ } = d^-y) - - \og ^''''''l;ff^^' ] + 

Therefore, Theorem|2]yields that it is impossible to consistently estimate / if 

Cid* + - logrf* + loglogrf — 21ogn > C2, (11) 

— - — '(hCz^My ' j + loglog(3/2) - logs - loglog3. A very sim- 
ple consequence of inequalities (10) and (TT) is that the consistent recovery of the sparsity pattern is 
possible under the condition d*l\ogn and impossible for rf*/logn ^ 00 as « ^ 00, provided that 
loglogrf =o(log«). 

Let us stress now that, all over this work, we have deliberately avoided any discussion on the com- 
putational aspects of the variable selection in nonparametric regression. The goal in this paper was to 
investigate the possibility of consistent recovery without paying attention to the complexity of the se- 
lection procedure. This lead to some conditions that could be considered a benchmark for assessing the 
properties of sparsity pattern estimators. As for the estimator proposed in Section[3l it is worth noting 
that its computational complexity is not always prohibitively large. A recommended strategy is to com- 
pute the coefficients 9^ in a stepwise manner; at each step K—\,2,...,d* only the coefficients 9^ with 
||fc||o = K need to be computed and compared with the threshold. If some 9^ exceeds the threshold, 
then aU the covariates Xi corresponding to nonzero coordinates of k are considered as relevant. We 
can stop this computation as soon as the number of covariates classified as relevant attains d*. While 
the worst-case complexity of this procedure is exponential, there are many functions f for which the 
complexity of the procedure will be polynomial in d. For example, this is the case for additive models 
in which f(jc) — fi(x;j ) + ... + fd*{Xi., ) for some univariate functions fi, . . . , f^. . 
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A Proof of Theorem [I] 

The empirical Fourier coefficients can be decomposed as follows: 

= 4 + where ^ = iV ^i^^f(x,) and z, = -y^^e,-. (12) 

If, for a multi index k, 9^—0, then the corresponding empirical Fourier coefficient wUl be close to zero 
with high probability. To show this, let us first look at what happens with z^'s. We have, for every real 
number x, 



P(|zfc|>x|A:i,...,X„)<expf--^l 'ik&Sm.d' 

with 



k ri2 Z-i 



< 



Therefore, for every k e Sm.d'y it holds that P(|zfc| > x|Xi,...,X„) < exp{—ng^^^^x^/4a^). This entails 
that by setting Ai — {8a^ d*log,[6m d]/ n g^^y/^ and by using the inequalities 



Card(Vd-) = X . (2^y<(2m)'^*^ — 



i=0 ^ ■' i=0 

<3{2mdf ' <[&mdf , 



we get 



P( max |zfc|>Ai|Xi,...,Z„]< ^ p(|zfc| > Ai |Xi, . . . ,X„] 



< CardCSm.rfOe""^"'"^'^'"'' < [Qnidy'^' . 

Next, we use a concentration inequality for controlling large deviations of Ou 's from 's. Recall that in 
view of the definition 6^ — "SiLi "^glx') ^^Xi), we have E(0fc) — 6^. By virtue of the boundedness of f, it 
holds that |^^f(Xi)| < V^Ioo/gmin- Furthermore, the bound V = Var(^^f(X,)) < J f\x)^dx < 
^^l/S'min combined with Bernstein's inequality yields 

P(l4 - 0k\ > t) < 2exp f ^ 1 

^ ^ 2[V+t^Lj3g^^)J 

r ■ nt^ ^ 
<2exp( j^Siii Vf>0. 

^ 4L^ + fLoogmin^ 



11 



Let us define A2 = 4L2 ( £1121^1 Then, 

V " &min 7 

Pf 6'/t-6ifc >A2] <2exp ^ ,^ . 

The first inequality in condition (4) implies that the denominator in the exponential is not larger than 
21,2- Hence, 

pf max iSk- Ok\> ^2] <2/[6mdf' . 
Let — {maxfces^ Iz^l < Ai} and — { ^sx^es,,, I^fcl < A2}. One easily checks that 

P(r ^ r]<H-^') +P[j!^^]<3/[6mdf'. 
As for the converse inclusion, we have 

P(/5z;7)<pf37e/s.t. max \Bk\<?] 

<l|37e/s.t. max < 2A| + P(.<) +P(j^„n. 

We show now that the first term in the last line is equal to zero. If this was not the case, then for some 
value 7o we would have Qj„ > k and < 2A, for all k e Sm.d* such that kj^ 7^ 0. This would imply that 

On the other hand, 

\\k\\2>m \\kh>mi'Ej 

Remark now that the choice of the truncation parameter m proposed in the statement of the proposi- 
tion implies that Qj^ — Qj„,m,d' < k/I. Combining these estimates, we get Qj^ < |- +4A^Af(rf*, m^/d*), 
which is impossible since Qj^ > k. 



B Proof of Proposition [U 

Proof of the first assertion. This proof can be found inl Mazo and Odlvzkd fl99^, we repeat here the 
arguments therein for the sake of keeping the paper self-contained. Recall that Ni{d*,y) admits an 
integral representation with the integrand: 

--(^)]^ 

For any real number y > 0, we define 0(y) = e^y\\'{e^y)l\\[e^y) — Xjtl-M "'^*^^/S!t=-M ^^^'^^ iri such 
a way that 

By virtue of the Cauchy-Schwarz inequality, it holds that 

^k'^e-y'^'^e-y''' > {^^k^e-y'^'y, Vye(0,oo), 

implying that 0'(y) < for all y e (0,oo), i.e., (p is strictly decreasing. Furthermore, is obviously 
continuous with limy^o 0(y ) = +00 and limy^oo 'Piy] — 0- These properties imply the existence and the 



h(zf 1 _ 1 

ZT''' Z[\-Z)~ Z{\-Z 



exp 
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uniqueness of e (0, oo) such that (piyy) — y. Furthermore, as the inverse of a decreasing function, the 
function 7- is decreasing as well. We set = e^^r so that 7 is increasing. 

We also have 



h"h-(hT 



r 



Proof of the seco nd assertion . We apply the saddle-point method to the integral representing Ni see, 
e.g.. Chapter IX in Dieudonni 1968h . It holds that 



I f \\{zY dz 1 
' 271/ T , z-c^ z(l-z) 271; ,, , 



{zCl-zjl-ie^'ir'^'dz. 



(13) 



The first assertion of the proposition provided us with a real number Zj such that l^(z^) = et l^'(z^) > 0. 
The tangent to the steepest descent curve at Zj is vertical. The path we choose for integration is the 
circle with center and radius z, ,. As this circle an d the steepest descent curve have the same tangent 
at Zj, applying formula (1.8.1) of Dieudonni 19681) (with a = since \"[Zj) is real and positive), we get 
that 

-ir cf {z(l - z)}-ie^*^(^'dz = -L /^^e-/2{z,(l - z,)}-^ e''''r(^^\\^o{V)\ 
2ni J|^|^^^ 27rj y rf*i;'(z,) 

when d* 00, as soon as the conditior03l[l,-(z) — Ij-(z^)] < — is satisfied for some /i > and for any z 
belonging to the circle |z| = \Zy\ and lying not too close to Zj. To check that this is indeed the case, we 
remark that 0l[lj,(z)] = log I ^1. Hence, if z = Zj.e'" with co e [cL>o,2n— a>o] for some cjq e]0,7r[, then 



zr 



_ |l + 2z+2i:fc>iz^1 ^ Il + ^l + ^r + 2i:,->i< ^ Il + e'"°z,| + z, + 2i:fc>i< 



f l+2z "1"^ Z '\ 

Therefore 0l[lj-(z) — 3ll^(Zj.)] < — /i with /i = log I ■ — ^^r^^ — p- J > 0. This completes the proof for 

the term Ni[d*,Y). The term Nzid'^yj) can be dealt in the same way. 



C Proof of LemmaH 

Let 0(0 be the density of ^(0, 1) and let 

n 

i=l 

Since the errors e,- are Gaussian, the posterior probabilities Pq and Pi are absolutely continuous w.r.t. 
the Lebesgue measure on R" and admit the densities 

n 

Pa[y) = W(!>{^yi), and Pi(y) = Ef.^,pf(j), VyeR". 



i=i 



Simple algebra yields: 



n 



1 3i u stands for the real part of the complex number u . 
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where Cf = HLi i " fiXif/2}. Thus, 

— (y) = Ef.^, [Cf n exp {y,f(X,)} 

Po i=l 

Therefore, 

(— (y))'po(y)£iy =E(f,P).^,«^, [Cf | f] f exp{y,(f + f'XX,)}0(y,)] dy 



:" i=l 



= E(f,n.;,,«^, [Cf Cff^exp (-(f + f'/(A:;)] 

1=1 
n 

= E(f,n.^,«^, [exp (^f(X,)f'(X,)]] 

! = 1 



w,a)'e{±l}''fc,fe'eS 



where ^^^XiJLi '/'fe(-X^i)'/'/t'(-X^i). for aU Jt, k' e S. Note that 0<bkk< 2A^n and <^^"e. for all 
fc, fc' e S such that A:' / fc. Now, on the one hand, for a fixed pair (&>, w'), we have 



Y\ exp (^(yfc(y'j.,fo/tfe'] < exp(ISp^^ne). 



k^k' 

On the other hand, if we are given a sequence of numbers [bkk) indexed by S, we have 
From these remarks it results that 

•d 



(y)) Vo(yMy < exp (4|s|^*.^{i + 



and the claim of the lemma follows. 



D Proof of Proposition [2] 

Let M — (^,) and let {fo,fi,...,fM} be a set included in Y.i. Let I\,...,Im be all the subsets of {\,...,d\ 
containing exactly d* elements somehow enumerated. Let us set f o = and define ff , for £ / 0, by its 
Fourier coefficients {0^ : fc e Z"^} as follows: 

_ |l, fc = (fci,...,fcrf) = (lie7,,...,ldei,,), 

1 0, otherwise. 

Obviously, all the functions ff belong to E and, moreover, each ff has as sparsity pattern. One easily 
checks that our choice of ff implies jr(Pf,,PfJ = n||ff — foUj = «. Therefore, if alogM — O!log(^,) > n, 
the desired inequality is satisfied. To conclude it suffices to note that log (^,) is larger than or equal to 
d* log(d/<i*) = (log d - log £/*) . 
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